17 research outputs found
LiveCap: Real-time Human Performance Capture from Monocular Video
We present the first real-time human performance capture approach that
reconstructs dense, space-time coherent deforming geometry of entire humans in
general everyday clothing from just a single RGB video. We propose a novel
two-stage analysis-by-synthesis optimization whose formulation and
implementation are designed for high performance. In the first stage, a skinned
template model is jointly fitted to background subtracted input video, 2D and
3D skeleton joint positions found using a deep neural network, and a set of
sparse facial landmark detections. In the second stage, dense non-rigid 3D
deformations of skin and even loose apparel are captured based on a novel
real-time capable algorithm for non-rigid tracking using dense photometric and
silhouette constraints. Our novel energy formulation leverages automatically
identified material regions on the template to model the differing non-rigid
deformation behavior of skin and apparel. The two resulting non-linear
optimization problems per-frame are solved with specially-tailored
data-parallel Gauss-Newton solvers. In order to achieve real-time performance
of over 25Hz, we design a pipelined parallel architecture using the CPU and two
commodity GPUs. Our method is the first real-time monocular approach for
full-body performance capture. Our method yields comparable accuracy with
off-line performance capture techniques, while being orders of magnitude
faster
SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes
Existing methods for the 4D reconstruction of general, non-rigidly deforming
objects focus on novel-view synthesis and neglect correspondences. However,
time consistency enables advanced downstream tasks like 3D editing, motion
analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a
general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method
takes multi-view RGB videos and background images from static cameras with
known camera parameters as input. It then reconstructs the deformations of an
estimated canonical model of the geometry and appearance in an online fashion.
Since this canonical model is time-invariant, we obtain correspondences even
for long-term, long-range motions. We employ neural scene representations to
parametrize the components of our method. Like prior dynamic-NeRF methods, we
use a backwards deformation model. We find non-trivial adaptations of this
model necessary to handle larger motions: We decompose the deformations into a
strongly regularized coarse component and a weakly regularized fine component,
where the coarse component also extends the deformation field into the space
surrounding the object, which enables tracking over time. We show
experimentally that, unlike prior work that only handles small motion, our
method enables the reconstruction of studio-scale motions.Comment: Project page: https://vcai.mpi-inf.mpg.de/projects/scenerflow
iSDF: Real-Time Neural Signed Distance Fields for Robot Perception
We present iSDF, a continual learning system for real-time signed distance
field (SDF) reconstruction. Given a stream of posed depth images from a moving
camera, it trains a randomly initialised neural network to map input 3D
coordinate to approximate signed distance. The model is self-supervised by
minimising a loss that bounds the predicted signed distance using the distance
to the closest sampled point in a batch of query points that are actively
sampled. In contrast to prior work based on voxel grids, our neural method is
able to provide adaptive levels of detail with plausible filling in of
partially observed regions and denoising of observations, all while having a
more compact representation. In evaluations against alternative methods on real
and synthetic datasets of indoor environments, we find that iSDF produces more
accurate reconstructions, and better approximations of collision costs and
gradients useful for downstream planners in domains from navigation to
manipulation. Code and video results can be found at our project page:
https://joeaortiz.github.io/iSDF/ .Comment: Project page: https://joeaortiz.github.io/iSDF
HyperReel: High-Fidelity 6-DoF Video with Ray-Conditioned Sampling
Volumetric scene representations enable photorealistic view synthesis for
static scenes and form the basis of several existing 6-DoF video techniques.
However, the volume rendering procedures that drive these representations
necessitate careful trade-offs in terms of quality, rendering speed, and memory
efficiency. In particular, existing methods fail to simultaneously achieve
real-time performance, small memory footprint, and high-quality rendering for
challenging real-world scenes. To address these issues, we present HyperReel --
a novel 6-DoF video representation. The two core components of HyperReel are:
(1) a ray-conditioned sample prediction network that enables high-fidelity,
high frame rate rendering at high resolutions and (2) a compact and
memory-efficient dynamic volume representation. Our 6-DoF video pipeline
achieves the best performance compared to prior and contemporary approaches in
terms of visual quality with small memory requirements, while also rendering at
up to 18 frames-per-second at megapixel resolution without any custom CUDA
code.Comment: Project page: https://hyperreel.github.io
Neural 3D Video Synthesis
We propose a novel approach for 3D video synthesis that is able to represent
multi-view video recordings of a dynamic real-world scene in a compact, yet
expressive representation that enables high-quality view synthesis and motion
interpolation. Our approach takes the high quality and compactness of static
neural radiance fields in a new direction: to a model-free, dynamic setting. At
the core of our approach is a novel time-conditioned neural radiance fields
that represents scene dynamics using a set of compact latent codes. To exploit
the fact that changes between adjacent frames of a video are typically small
and locally consistent, we propose two novel strategies for efficient training
of our neural network: 1) An efficient hierarchical training scheme, and 2) an
importance sampling strategy that selects the next rays for training based on
the temporal variation of the input videos. In combination, these two
strategies significantly boost the training speed, lead to fast convergence of
the training process, and enable high quality results. Our learned
representation is highly compact and able to represent a 10 second 30 FPS
multi-view video recording by 18 cameras with a model size of just 28MB. We
demonstrate that our method can render high-fidelity wide-angle novel views at
over 1K resolution, even for highly complex and dynamic scenes. We perform an
extensive qualitative and quantitative evaluation that shows that our approach
outperforms the current state of the art. We include additional video and
information at: https://neural-3d-video.github.io/Comment: Project website: https://neural-3d-video.github.io